Using K-Means Clustering to Determine Groupings of Fatal Car Crashes
Kyle Knuth
Siva Gopavarapu
Nikhitha Karlapudi
Eswar Sai Burugupalley
Bala Venkataprasad Sadhana Vittal
2024-04-11
Intro to K-means clustering: (Unsupervised Technique)
K-means clustering groups data into clusters to maximize similarity and minimize dissimilarity.
Methods for optimal cluster number: Elbow, Silhouette, and Gap Statistic.
Preprocessing techniques like Canopy algorithm for large datasets.
Cluster quality assessed using Calinski-Harabasz criterion.
K-means process: iterative centroid recalculation for cluster convergence.
Efficiency improved by using PCA for initial centroid estimation.
Applications
K-means clustering applied in various fields:
Color segmentation in fruit disease identification.
Geographic clustering for aquifer risk assessment and regional classification in Indonesia.
Medical applications: predicting chemotherapy response, analyzing skin changes with age, pregnancy problem analysis, and allergen sensitivity prediction.
Tourism industry: clustering tourist preferences to enhance satisfaction.
Agriculture: soil classification and spatial representation improvement.
Economic and health behavior clustering for county-level analysis.
Variations:
Constrained K-means clustering imposes lower bounds on clusters and distances between them, potentially deleting or consolidating clusters based on constraints
Improved K-means algorithm enhances speed and efficiency by using two data structures to store cluster labels and distances, reducing repeated distance calculations
Filtering algorithm optimizes Lloyd’s K-means using kd-tree structure, improving efficiency and scalability to higher dimensions
K-means clustering can be integrated with MapReduce from Hadoop for efficient data processing by calculating distances between centroids and data points, iteratively updating centroids until convergence
Extended K-means considers cluster sizes by sorting clusters based on the number of elements and recruits or drops elements from clusters accordingly.
Limitations:
Random initialization of centroids.
Requires prior knowledge of the number of clusters.
Difficulty handling diverse data types.
Choosing incorrect number of clusters can introduce variability.
Issues arise with complex data and scalability.
Requires careful preprocessing and adequate sample size.
Methods
The methodology behind k-means clustering involves preprocessing the data, picking variables of interest, determining the optimal number of clusters, performing the k-means clustering, and identifying what each cluster consists of. The data is typically preprocessed by first removing any categorical variables, then removing outliers, and finally performing feature scaling. Once the data has been preprocessed, variables of interest will be picked and then the optimal number of clusters will be determined.
Silhouette Coefficient
The first method of determining how many clusters to use is the Silhouette Coefficient. The Silhouette Coefficient is used to measure how similar a data point is to its own cluster compared to other clusters based off a range of -1 to 1 where a more positive value means that the data is well placed within its cluster[@shi2021quantitative].
\[
S=b-a/max(b,\:a)
\]
S: Silhouette coefficient
a: the average distance from one data point to the other within the same cluster
b: the minimum average distance from one point to points in other clusters
Elbow Method
The next method of determining how many clusters to use is the Elbow method. The Elbow method consists of plotting the within-cluster sum of squares against the number of clusters and visually selecting a point where the rate of decrease significantly slows down [@shi2021quantitative].
The final method in determining how many clusters to use is the Gap statistic. This method finds the optimal number of clusters by comparing the within cluster dispersion changes to expected with a null dispersion[@tibshirani2001estimating].
\[
Gap_n(k)=E^*_n\{log(W_K)\}-log(W_k)
\]
Gapn(k): Gap statistic
n: sample size
k: number of clusters
WK:within-cluster sum of squares surrounding cluster means
\(E^*_n\): Expectation of sample size
Euclidean Distance
The distance metric used for these methods is Euclidean distance. Euclidean distance is defined as the straight line distance between two points [@ultsch2022euclidean].
\[
d(x,y)=\sqrt{\sum|x_i-y_i|^2}
\]
d(x,y): Euclidean distance
x: x-coordinate point
y: y-coordinate point
K-Means process
The actual method of k-means clustering is an iterative process consisting of the following steps [@9320839]:
Determine the number of clusters
Randomly select an initial centroid
Using Euclidean distance, data is assigned to clusters in accordance to the distance of the centroids
A new centroid is recalculated based on the mean of the previous centroids.
All the previous steps are repeated until the mean of the centroids no longer changes
Assumptions
The assumptions of k-means clustering are as follows [@raykov2016k]:
The data is in a sphere around the centroid and each sphere has the same radius
The average of the points in a cluster is the centroid of the cluster meaning that outliers will dramatically affect clustering
“Fatalities” dataset is from the AER R package which originally came from US Department of Transportation Fatal Accident Reporting System and is dated from 1982-1988 (The data includes all the states except Alaska and Hawaii)
The five graphs below show a boxplot of each variable of interest. As seen below there are some outliers that need to be removed so we can cluster our data effectively.
boxplot(df$afatal, horizontal =TRUE, xlab="Count", ylab="Fatal Car Crashes (Alcohol)")
First, check if there are any outliers in our data and remove them before scaling the data. We need to remove the outliers since it can affect the centroid in the k-means clsutering process. In addition, the data needs to be scaled since k-means clustering is reliant on Euclidean distance and a large difference in values can adversly affect it.
The elbow method is performed and as shown by the graph, the optimal number of clusters is 6.
fviz_nbclust(cleaned_df2, kmeans, method ="wss")
According to this graph the optimal number of clusters could be either 10 or 6. These numbers were chosen since they are the closest to 1 for average silhouette width.
Lastly we are using the gap statistic as a final check to determine the best number of clusters. This graph also shows 6 clusters as being the most optimal.
Cluster 1: This cluster is characterized by an average consumption of 1.47 alcoholic drinks per person, with 323 alcohol-related car crashes on average. The average beer tax is relatively high at 93 cents. It predominantly comprises Baptists rather than Mormons.
Cluster 2: Cluster 2 exhibits a slightly higher average consumption of alcoholic drinks per person (1.72) compared to Cluster 1, with a lower average of 117 alcohol-related car crashes. The average beer tax is notably lower at 23.7 cents. This cluster consists of somewhat more Mormons than Baptists.
Cluster 3: In Cluster 3, the average consumption of alcoholic drinks per person is 1.39, with 113 alcohol-related car crashes on average. The average beer tax falls in between the previous clusters at 50.1 cents. This cluster has similar proportions of Baptists and Mormons.
Cluster Analysis(average)
Cluster 4: This cluster demonstrates the highest average consumption of alcoholic drinks per person (2.18) among the clusters, with 142 alcohol-related car crashes on average. The average beer tax is similar to Cluster 2 at 24.5 cents. Similar to Cluster 3, it contains similar amounts of Baptists and Mormons.
Cluster 5: Cluster 5 has an average consumption of 1.6 alcoholic drinks per person, with 473 alcohol-related car crashes on average. The average beer tax is 27 cents. This cluster contains more Mormons than Baptists.
Cluster 6: Finally, Cluster 6 displays the lowest average consumption of alcoholic drinks per person (1.24), with 295 alcohol-related car crashes on average. The average beer tax is 35 cents. Significantly, this cluster contains more Baptists than Mormons.
Cluster analysis (by variable)
In this first chart it is shown that cluster 5 had the highest number of car crash fatalities related to alcohol while clusters 2 and 3 had the lowest amount. Cluster 1 had the second highest amount.
In this next chart, cluster 4 had the highest number of alcohol consumed while cluster 6 had the lowest amount. The second highest was cluster 2.
Cluster analysis (by variable)
As shown below, clusters 1 and 6 had the highest population of Baptists while clusters 3 and 4 had the lowest amount.
In this chart, cluster 2 had the highest population of Mormons and clusters 3 and 4 had the lowest population.
Cluster analysis (by variable)
Cluster 1 had the highest beertax while clusters 2 and 4 had the lowest beertax.
Cluster analysis (state)
The below graphs show the same information as the bar charts but at the state level. We created a subset of the original data that now has the clusters from our k-means clustering analyis to make inferences. This is to highlight which states belong to which cluster.
Below is the map in which states are divided based on clusters
IN CONCLUSION
The analysis of the “Fatalities” dataset, sourced from the US Department of Transportation’s Fatal Accident Reporting System, was conducted to understand fatal car crashes.
K-means clustering was utilized as an effective method for grouping data to extract insights.
The analysis process involved data pre-processing, determining the optimal number of clusters, and implementing the K-means clustering model.
Six clusters were generated to elucidate trends related to alcohol-related car crashes, considering factors such as alcohol consumption, religion, and beer tax.
–
The clusters helped to create subregions within the United States, each exhibiting distinct characteristics and providing insights into alcohol handling and its impacts.
Insights derived from the analysis can aid in identifying potential areas for targeted interventions regarding alcohol-related issues.
Acknowledgement of the subjectivity involved in selecting the appropriate number of clusters, with acknowledgment of potential variability in results and accuracy.
–
The decision-making process for determining the optimal number of clusters was facilitated by methods such as the Elbow method, Silhouette method but decision became easier after observing the Gap Statistic.